The goal of voice conversion is to modify a source speakers speech to sound as if spoken by a target speaker.\nCommon conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically model the\nspectral structure of the source and target signals and require relatively large training sets (typically dozens of\nsentences) to avoid over-fitting. Moreover, they often lead to muffled synthesized output signals, due to excessive\nsmoothing of the spectral envelopes.\nMobile applications are characterized with low resources in terms of training data, memory footprint, and\ncomputational complexity. As technology advances, computational and memory requirements become less limiting;\nhowever, the amount of available training data still presents a great challenge, as a typical mobile user is willing to\nrecord himself saying just few sentences. In this paper, we propose the grid-based (GB) conversion method for such\nlow resource environments, which is successfully trained using very few sentences (5ââ?¬â??10). The GB approach is based\non sequential Bayesian tracking, by which the conversion process is expressed as a sequential estimation problem of\ntracking the target spectrum based on the observed source spectrum. The converted Mel frequency cepstrum\ncoefficient (MFCC) vectors are sequentially evaluated using a weighted sum of the target training vectors used as grid\npoints. The training process includes simple computations of Euclidian distances between the training vectors and is\neasily performed even in cases of very small training sets.\nWe use global variance (GV) enhancement to improve the perceived quality of the synthesized signals obtained by\nthe proposed and the GMM-based methods. Using just 10 training sentences, our enhanced GB method leads to\nconverted sentences having closer GV values to those of the target and to lower spectral distances at the same time,\ncompared to enhanced version of the GMM-based conversion method. Furthermore, subjective evaluations show\nthat signals produced by the enhanced GB method are perceived as more similar to the target speaker than the\nenhanced GMM signals, at the expense of a small degradation in the perceived quality.
Loading....